Fuzzy Clustering based on Semantic Body and its Application in Chinese Spam Filtering
نویسندگان
چکیده
E-mail’s text is the main body of an E-mail. Its content is reflected by semantic body formed by a large number of semantic elements, so it is the most authoritative and effective to study semantic body information of spam when analyzing its text. Firstly, this paper takes the advantage of HowNet in analysis of semantic element and analyze semantic bodies in email text, then proposes the method of constructing semantic body and calculation ways of similarity between semantic bodies based on sentence similarity. Secondly, for the problem of Imprecision and Fuzziness existing in current spam filtering technology, we use fuzzy clustering method to solve it. Combining fuzzy clustering with the semantic body, the paper proposes the method of fuzzy clustering based on semantic body. It is different from the traditional methods that semantic body is used as the object to be classified and the similarity between semantic bodies used as similarity coefficient in the proposed method. The method reduces the dimension when we use fuzzy clustering method to deal with text clustering problem. Finally, we apply the new method of fuzzy clustering based on semantic body to spam filtering. The result of the experiment shows that this method is more objective in determining email content when comparing with the method of traditional email filtering in semantic unit. The proposed method reflects much better in recall rate of discernment of email for spam whose meaning is expressed unclearly.
منابع مشابه
A Novel Method of Text Clustering for Chinese Spam Based on Semantic Body
The effect of spam filtering method based on statistics is not good in filtering the new-type spam with synonymous substitution and camouflage. So a new text clustering method based on Semantic Body for filtering Chinese spam is proposed. In this paper, the word sense disambiguation, lexical chain based on HowNet and statistic-based TFIDF are adopted to extract features of mails. The Semantic B...
متن کاملApplications of Text Clustering Based on Semantic Body for Chinese Spam Filtering
The effect of spam filtering method based on statistics is not good enough in filtering the new-type spam with synonymous substitution and camouflage, because the method based on statistics ignores the semantic relation between words in the text, and only judges from the word itself. So, a method of spam filtering based on the semantic body is proposed in this paper. The method adopts lexical c...
متن کاملUse of Semantic Similarity and Web Usage Mining to Alleviate the Drawbacks of User-Based Collaborative Filtering Recommender Systems
One of the most famous methods for recommendation is user-based Collaborative Filtering (CF). This system compares active user’s items rating with historical rating records of other users to find similar users and recommending items which seems interesting to these similar users and have not been rated by the active user. As a way of computing recommendations, the ultimate goal of the user-ba...
متن کاملA Fuzzy C-means Algorithm for Clustering Fuzzy Data and Its Application in Clustering Incomplete Data
The fuzzy c-means clustering algorithm is a useful tool for clustering; but it is convenient only for crisp complete data. In this article, an enhancement of the algorithm is proposed which is suitable for clustering trapezoidal fuzzy data. A linear ranking function is used to define a distance for trapezoidal fuzzy data. Then, as an application, a method based on the proposed algorithm is pres...
متن کاملApplication of Refined LSA and MD5 Algorithms in Spam Filtering
The paper proposes a spam filtering method that uses integrated and refined Latent Semantic Analysis (LSA) and Message-Digest Algorithm 5 (MD5) algorithms to address a series of universal problems in spam filtering, including remarkably lowered filtering precision and notably unbalanced filtering efficiency as a result of lack of latent semantic analysis of mail contents. In introducing LSA, it...
متن کامل